Intro to NGS processing

James A. Fellows Yates

2021-08-17

Who am I?

  • Education
    • B.Sc. Bioarchaeology (University of York, UK)
    • M.Sc. Naturwissenschaftliches Archäologie (University of Tübingen, DE)
    • Ph.D. Archaeogenetics (MPI-SHH / MPI-EVA, DE)
  • Experience
    • Number of genetics classes taken: 0
    • Number of bioinformatics classes taken: 0

@jfy133

Icons designed by OpenMoji. License: CC BY-SA 4.0

Today we will

  1. Describe basics of DNA
  2. Introduce what DNA sequencing is
  3. Explain how Illumina NGS sequencing data is generated
  4. How to evaluating NGS data [Practical]

Introduction to DNA

What is DNA?

Deoxyribonucleic acid (/diːˈɒksɪˌraɪboʊnjuːˌkliːɪk, -ˌkleɪ-/ (DNA) is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. - Wikipedia

What is DNA?

Structure ADN Zephyris, CC BY-SA 3.0 via Wikimedia Commons

What is DNA?

Structure ADN Pradana Aumars, CC BY-SA 4.0, via Wikimedia Commons

The rules

  • Four nucleotides
    • Pyrimidines: Cytosine, Thymine
    • Purines: Guanine Adenine &
  • Base pairing: one pyrimidine with one purine
    • C with G (think: CGI)
    • A with T (think: AT-AT walker)
  • Complementary
    • C on one strand, G on the other (or v.v.)
    • A on one strand, T on the other (or v.v.)

AT-AT Walker AT-AT Walker by Nick Bluth from the Noun Project, CC BY 3.0

The rules

How do we get DNA?

Figure 17 01 02 CNX OpenStax, CC BY 4.0, via Wikimedia Commons

What about ancient DNA?

  • Basically the same, except: aDNA molecules are degraded
    • Fragmented (short molecules)
    • Damaged (modified nucleotides)

Sequencing ancient DNA © 2015 Lucy Reading / The Scientist. All rights reserved. Used here for training purposes only.

© 2015 Lucy Reading / The Scientist. All rights reserved. Modified and used here for training purposes only.

Introduction to DNA Sequencing

What is Sequencing?

Converting the chemical nucleotides of a DNA molecule

to

ACTG on your computer screen

Icons designed by OpenMoji. License: CC BY-SA 4.0

Historically

Sanger-sequencing Estevezj, CC BY-SA 3.0 via Wikimedia Commons

  • Sanger sequencing
    • Separate strands, add primer (starting point)
    • Add mix of nucleotides, some with special ‘terminators’
    • Pass through size-filtering, read order of terminators

Pros and cons of Sanger Sequencing

  • Pros
    • Very precise (few errors, still the ‘gold standard’)
    • Sequence long DNA molecules
  • Cons
    • Resource heavy, requiring lot of input DNA
    • Slow: one. fragment. at. a. time.

What is NGS?

  • “Next Generation Sequencing”
    • Sequence millions and even billions of DNA reads at once!
    • via MASSIVE multiplexing!
    • Sequence lots of samples at once!
    • Fast and cheap!

Not really ‘next’ anymore, consider it more ‘second’ generation (see: Nanopore)

What is NGS?

Market leader:

Illumina HiSeq 2500 Konrad Förstner, CC0, via Wikimedia Commons

(Others: Roche 454, PacBio, IonTorrent etc.)

How does it work?

  • Basically same concept, but:
    • no size separation
    • with pretty pictures!

i.e. to a strand, attach a complementary fluorophore-modified nucleotide, (normally) one colour per base

A

G

T

C

Fire mah lazer, and take a picture! Rinse and repeat!

How does it work?

via Gfycat

Where does this happen?

On a ‘flow cell’

Next generation sequencing slide Bronner et al. (2013) Current Protocols in Human Genetics, DOI: (10.1002/0471142905.hg1802s79)

Where does this happen?

But how do you get your DNA to attach to the lawn

(and not get lost)?

  • Convert it to library:
    • Add adapters: bind to the ‘lawn’ of the flow cell
    • Add indexes: sample-specific barcode
    • Add priming sites: where enzymes start copying DNA

AATGATACGGCGACCACCACaccgacaaCCCTACACGACGCTCTTCCGATCTXXXXXXAGCACACGTCTGAACTCCAGTCACgacactaCCGTCTTCTGCTTG ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TTACTATGCCGCTGGTGGTGtggctgttGGGATGTGCTGCGAGAAGGCTAGAXXXXXXTCGTGTGCAGACTTGAGGTCAGTGctgtgatGGCAGAAGACGAAC

[Adapter & Index Primer] [Index] [Target primer] [Target] [Target primer] [Index] [Adapter & Index Primer]

Sequencing-by-synthesis

Once bound, florescence of one molecule not enough…

Cluster Generation DMLapato, CC BY-SA 4.0, via Wikimedia Commons

  • Make lots of copies, a.k.a. clustering!
  • One cluster == many copies of one DNA molecule

Sequencing-by-synthesis

Cluster Generation Abizar Lakdawalla , CC BY 3.0, via https://openlab.citytech.cuny.edu/

  1. Add florescent nucleotides (complementary will bind)
  2. Wash away unbound nucleotides
  3. Fire laser & take photo
  4. Remove fluorophore
  5. Back to 1 ⤴️

What does this look like?

Cluster Generation EMBL-EBI Training, CC BY-SA 4.0, via https://www.ebi.ac.uk/training/

Improving quality

  • Over time, imaging reagents get ‘tired’ and more errors occur
    • Bases sometimes don’t bind, or multiple == clusters ‘desynced’
    • Base-quality: machine calculates probability it got the ‘right’ nucleotide for each photo
    • ‘Dead’ base call: typically reported as N
  • How to improve or correct?

Improving quality

  • Improvement: paired-end sequencing
    • Get order of nucleotides by sequencing from one end
    • Get reverse order of nucleotides - sequence other end!
    • Bonus: sequence more of read longer than cycles

MiSeq™, HiSeq™ 1000/1500/2000/2500 and NovaSeq™ 6000 v1.0 reagents paired-end flow cell, © 2021 Illumina, Inc. All rights reserved. Used here for training purposes only

© 2021 Illumina, Inc. All rights reserved. Used here for training purposes only.

Photos to DNA string

  • Special software (e.g. bcl2fastq):

  • For each location on the flow cell (cluster):

    • Record the sequence of bases (from colours)
    • Calculates a probability the ‘base call’ is correct i.e. blurry or weak image?
    • Note the index in the sequence (sample-specific barcode)
  • Group each recorded sequence or ‘reads’ with those with the same index

    • a.k.a. demultiplexing

FASTQ File

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. - Wikipedia

FASTQ File

Example

@K00233:37:HGHLYBBXX:3:1101:2646:1121 1:N:0:NACGCATC+NGCTAATG
NCGCATGAGCCGCCTGTATCAGGCGCTGATCGAACCGGGCATTGCAGTTGGGATAGATCGGAAGAGCACACGTCTG
+
#A7F<<AA<JFJFJJJJJJFFJJJJJJJAFFJFJJJJJJJFJAFFFJAJFJJ<FJJJJJFFF<FFA--FFFJJJJJ
@K00233:37:HGHLYBBXX:3:1101:4655:1121 1:N:0:NACGCATC+NGCTAATG
NATGCATGACAGGAGGTGAGGGCATTTTCCAGATTTTCAGGCTGCGACCTTGAGCATCTTTCGCCGCTTCCAGCAC
+
#AA-<FFFF7JFF7JJJJJFJJ<JJJJJA7FJJJJJJJFF<JFF<J7-<FJJJJFJFFJJJAAAAFFJJ--AJAJJ
@ <read id, e.g. machine ID, location on flowcell> <extra metadata>
  <DNA sequence; Note: N = base couldn't be called!>
+ <a separator>
  <base quality scores for each nucleotide in sequence>

Quality score

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
0.2......................26...31........41          

Things to remember

  • Adapters and indices
    • a.k.a. sequencing ‘artefacts’ not the real DNA molecule!
  • Base qualities
    • Cycle-quality decay
  • Paired-end sequencing!
    • Merge together: better confidence in base call

Recap

  • DNA molecules essentially:
    • Made up of nucleotides (ACTG)
    • Two strands: complementary base pairs (C-G, A-T)
    • aDNA are fragmented molecules: short
  • NGS Sequencing:
    • Massively multiplexed: millions DNA molecules at once
    • Add adapters to bind to a glass slide
    • Make new strand, adding florescent nucleotides
    • Fire laser at each nucleotide and take photo
  • Results in **FASTQ* file
    • Has base quality scores

Practical: Introduction to NGS data processing

Working on the command-line

What is the command line?

A command-line interface (CLI) processes commands to a computer program in the form of lines of text. - Wikipedia

  • i.e. use words, not point and click with mouse
  • Important: more efficient/scalable & more reproducible
  • Most bioinformatics work is performed via command line
    • Often as working on remote servers (i.e. very large computers with no screen)

Logging into a server

  1. Open browser
  2. Go to:
  3. Log-in with your credentials

The command line

A command prompt (or just prompt) is a sequence of (one or more) characters used in a command-line interface to indicate readiness to accept commands. - Wikipedia

james_fellows_yates@bionc21:~$ 
<username>@<machine_name>:<current_directory>$
  • Everything after $ is where you type your command
  • Never copy and paste the prompt!
  • ⚠️Prompts look different on different machines!

Your first command

Type in everything after the prompt, and press enter/return (⏎) on your keyboard with

$ echo "Hello world!"
Hello world!
  • Command typically consists of:
    1. Program/software/tool name
    2. Arguments (e.g. input files)
    3. Options or flags (e.g. -h or --help)

Move around

What is in the room (directory)

$ ls

Lets go in the directory, and see what’s in there!

$ cd input/
$ ls -l

How to go back?

$ cd ../

Your first bioinformatic job

We will run the nf-core/eager pipeline.

nf-core/eager is a scalable and reproducible bioinformatics best-practise processing pipeline for genomic NGS sequencing data, with a focus on ancient DNA (aDNA) data. It is ideal for the (palaeo)genomic analysis of humans, animals, plants, microbes and even microbiomes.

Pipeline (software): a chain of data-processing processes or other software entities

Run your first bioinformatic job

nextflow nf-core/eager -profile singularity,test_tsv --input input/fastqs.tsv

Check the results

What is the output?

ls
cd results/
multiqc/

Practical: Introduction to NGS data quality control

MultiQC Report